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Abstract — The performance of estimating the common support 
for jointly sparse signals based on their projections onto lower- 
dimensional space is analyzed. Support recovery is formulated 
as a multiple-hypothesis testing problem. Both upper and lower 
bounds on the probability of error are derived for general 
measurement matrices, by using the Chernoff bound and Fano's 
inequality, respectively. The upper bound shows that the perfor- 
mance is determined by a quantity measuring the measurement 
matrix incoherence, while the lower bound reveals the importance 
of the total measurement gain. The lower bound is applied 
to derive the minimal number of samples needed for accurate 
direction-of-arrival (DOA) estimation for a sparse representation 
based algorithm. When applied to Gaussian measurement ensem- 
bles, these bounds give necessary and sufficient conditions for a 
vanishing probability of error for majority realizations of the 
measurement matrix. Our results offer surprising insights into 
sparse signal recovery. For example, as far as support recovery 
is concerned, the well-known bound in Compressive Sensing 
with the Gaussian measurement matrix is generally not sufficient 
unless the noise level is low. Our study provides an alternative 
performance measure, one that is natural and important in 
practice, for signal recovery in Compressive Sensing and other 
application areas exploiting signal sparsity. 

Index Terms — Chernoff bound, Compressive Sensing, Fano's 
inequality, jointly sparse signals, multiple hypothesis testing, 
probability of error, support recovery 

I. Introduction 

SUPPORT recovery for jointly sparse signals concerns 
accurately estimating the non-zero component locations 
shared by a set of sparse signals based on a limited number 
of noisy linear observations. More specifically, suppose that 
{x(t) £ ¥ N , t — 1,2,..., T}, F = R or C, is a sequence 
of jointly sparse signals (possibly under a sparsity-inducing 
basis $ instead of the canonical domain) with a common 
support S, which is the index set indicating the non-vanishing 
signal coordinates. This model is the same as the joint sparsity 
model 2 (JSM-2) in [1]. The observation model is linear: 

y(t) = Ax(t) + w(t) t=l,2,...,T. (1) 

In (D, A £ ^ M >< N i s the measurement matrix, y(t) £ F M 
the noisy data vector, and w(i) £ F M an additive noise. 
In most cases, the sparsity level K = \S\ and the number 
of observations M is far less than N, the dimension of 
the ambient space. This problem arises naturally in several 
signal processing areas such as Compressive Sensing [2]-[6], 
source localization [7]-[10], sparse approximation and signal 
denoising [11]. 
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Compressive Sensing [2]-[4], a recently developed field 
exploiting the sparsity property of most natural signals, shows 
great promise to reduce signal sampling rate. In the classical 
setting of Compressive Sensing, only one snapshot is consid- 
ered; i.e., T = 1 in £[). The goal is to recover a long vector 
x := x(l) with a small fraction of non-zero coordinates from 
the much shorter observation vector y :— y(l). Since most 
natural signals are compressible under some basis and are 
well approximated by their K— sparse representations [12], 
this scheme, if properly justified, will reduce the necessary 
sampling rate beyond the limit set by Nyquist and Shannon 
[5], [6]. Surprisingly, for exact K— sparse signals, if M = 
0(K \og(j?)) <C N and the measurement matrix is generated 
randomly from, for example, a Gaussian distribution, we 
can recover x exactly in the noise-free setting by solving 
a linear programming task. Besides, various methods have 
been designed for the noisy case [13]— [17]. Along with these 
algorithms, rigorous theoretic analysis is provided to guarantee 
their effectiveness in terms of, for example, various Z p -norms 
of the estimation error for x [13]— [17]. However, these results 
offer no guarantee that we can recover the support of a sparse 
signal correctly. 

The accurate recovery of signal support is crucial to Com- 
pressive Sensing both in theory and in practice. Since for sig- 
nal recovery it is necessary to have K < M, signal component 
values can be computed by solving a least squares problem 
once its support is obtained. Therefore, support recovery is a 
stronger theoretic criterion than various Z p -norms. In practice, 
the success of Compressive Sensing in a variety of applications 
relies on its ability for correct support recovery because the 
non-zero component indices usually have significant physical 
meanings. The support of temporally or spatially sparse signals 
reveals the timing or location for important events such as 
anomalies. The indices for non-zero coordinates in the Fourier 
domain indicate the harmonics existing in a signal [18], which 
is critical for tasks such as spectrum sensing for cognitive 
radios [19]. In compressed DNA microarrays for bio-sensing, 
the existence of certain target agents in the tested solution is 
reflected by the locations of non-vanishing coordinates, while 
the magnitudes are determined by their concentrations [20]- 
[23]. For compressive radar imaging, the sparsity constraints 
are usually imposed on the discretized time-frequency do- 
main. The distance and velocity of an object have a direct 
correspondence to its coordinate in the time-frequency domain. 
The magnitude determined by coefficients of reflection is of 
less physical significance [24]-[26]. In sparse linear regression 
[27], the recovered parameter support corresponds to the few 
factors that explain the data. In all these applications, the 
support is physically more significant than the component 
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values. 

Our study of sparse support recovery is also motivated by 
the recent reformulation of the source localization problem as 
one of sparse spectrum estimation. In [7], the authors trans- 
form the process of source localization using sensory arrays 
into the task of estimating the spectrum of a sparse signal 
by discretizing the parameter manifold. This method exhibits 
super-resolution in the estimation of direction of arrival (DOA) 
compared with traditional techniques such as beamforming 
[28], Capon [29], and MUSIC [30], [31]. Since the basic model 
employed in [7] applies to several other important problems 
in signal processing (see [32] and references therein), the 
principle is readily applicable to those cases. This idea is later 
generalized and extended to other source localization settings 
in [8]— [10]. For source localization, the support of the sparse 
signal reveals the DOA of sources. Therefore, the recovery 
algorithm's ability of exact support recovery is key to the 
effectiveness of the method. We also note that usually multiple 
temporal snapshots are collected, which results in a jointly 
sparse signal sets as in (Q~|i. In addition, since M is the number 
of sensors while T is the number of temporal samples, it is far 
more expensive to increase M than T. The same comments 
apply to several other examples in the Compressive Sensing 
applications discussed in the previous paragraph, especially the 
compressed DNA microarrays, spectrum sensing for cognitive 
radios, and Compressive Sensing radar imaging. 

The signal recovery problem with joint sparsity con- 
straint [33]-[36], also termed the multiple measurement vector 
(MMV) problem [37]-[41], has been considered in a line of 
previous works. Several algorithms, among them Simultaneous 
Orthogonal Matching Pursuit (SOMP) [34], [37], [40]; convex 
relaxation [41]; l x -minimization [38], [39]; and M-FOCUSS 
[37], are proposed and analyzed, either numerically or theoret- 
ically. These algorithms are multiple-dimension extensions of 
their one-dimension counterparts. Most performance measures 
of the algorithms are concerned with bounds on various norms 
of the difference between the true signals and their estimates or 
their closely related variants. The performance bounds usually 
involve the mutual coherence between the measurement matrix 
A and the basis matrix $ under which the measured signals 
x(t) have a jointly sparse representation. However, with joint 
sparsity constraints, a natural measure of performance would 
be the model (Q])'s potential for correctly identifying the true 
common support, and hence the algorithm's ability to achieve 
this potential. As part of their research, J. Chen and X. Huo 
derived, in a noiseless setting, sufficient conditions on the 
uniqueness of solutions to (fTJ under £q and l\ minimization. 
In [37], S. Cotter et. al. numerically compared the probabil- 
ities of correctly identifying the common support by basic 
matching pursuit, orthogonal matching pursuit, FOCUSS, and 
regularized FOCUSS in the multiple-measurement setting with 
a range of SNRs and different numbers of snapshots. 

The availability of multiple temporal samples offers serval 
advantages to the single-sample case. As suggested by the 
upper bound d26l i on the probability of error, increasing the 
number of temporal samples drives the probability of error 
to zero exponentially fast as long as certain condition on the 
inconsistency property of the measurement matrix is satisfied. 



The probability of error is driven to zero by scaling the SNR 
according to the signal dimension in [42], which is not very 
natural compared with increasing the samples, however. Our 
results also show that under some conditions increasing tem- 
poral samples is usually equivalent to increasing the number 
of observations for a single snapshot. The later is generally 
much more expensive in practice. In addition, when there is 
considerable noise and the columns of the measurement matrix 
are normalized to one, it is necessary to have multiple temporal 
samples for accurate support recovery as discussed in Section 

HV] and Section EH 

Our work has several major differences compared to related 
work [43] and [42], which also analyze the performance 
bounds on the probability of error for support recovery using 
information theoretic tools. The first difference is in the way 
the problem is modeled: In [42], [43], the sparse signal 
is deterministic with known smallest absolute value of the 
non-zero components while we consider a random signal 
model. This leads to the second difference: We define the 
probability of error over the signal and noise distributions with 
the measurement matrix fixed; In [42], [43], the probability 
of error is taken over the noise, the Gaussian measurement 
matrix and the signal support. Most of the conclusions in 
this paper apply to general measurement matrices and we 
only restrict ourselves to the Gaussian measurement matrix 
in Section [V] Therefore, although we use a similar set of 
theoretical tools, the exact details of applying them are quiet 
different. In addition, we consider a multiple measurement 
model while only one temporal sample is available in [42], 
[43]. In particular, to get a vanishing probability of error, 
Aeron et.al. [42] require to scale the SNR according to the 
signal dimension, which has a similar effect to having multiple 
temporal measurements in our paper. Although the first two 
differences make it difficult to compare corresponding results 
in these two papers, we will make some heuristic comments 
in Section [Vl 

The contribution of our work is threefold. First, we intro- 
duce a hypothesis-testing framework to study the performance 
for multiple support recovery. We employ well-known tools in 
statistics and information theory such as the Chernoff bound 
and Fano's inequality to derive both upper and lower bounds 
on the probability of error. The upper bound we derive is for 
the optimal decision rule, in contrast to performance analysis 
for specific sub-optimal reconstruction algorithms [13]— [17]. 
Hence, the bound can be viewed as a measure of the measure- 
ment system's ability to correctly identify the true support. Our 
bounds isolate important quantities that are crucial for system 
performance. Since our analysis is based on measurement 
matrices with as few assumptions as possible, the results can 
be used as a guidance in system design. Second, we apply 
these performance bounds to other more specific situations 
and derive necessary and sufficient conditions in terms of 
the system parameters to guarantee a vanishing probability of 
error. In particular, we study necessary conditions for accurate 
source localization by the mechanism proposed in [7]. By 
restricting our attention to Gaussian measurement matrices, 
we derive a result parallel to those for classical Compressive 
Sensing [2], [3], namely, the number of measurements that 
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are sufficient for signal reconstruction. Even if we adopt the 
probability of error as the performance criterion, we get the 
same bound on M as in [2], [3]. However, our result suggests 
that generally it is impossible to obtain the true support 
accurately with only one snapshot when there is considerable 
noise. We also obtain a necessary condition showing that the 
log term cannot be dropped in Compressive Sensing. Last 
but not least, in the course of studying the performance bounds 
we explore the eigenvalue structure of a fundamental matrix 
in support recovery hypothesis testing for both general mea- 
surement matrices and the Gaussian measurement ensemble. 
These results are of independent interest. 

The paper is organized as follows. In Section [TTJ we 
introduce the mathematical model and briefly review the 
fundamental ideas in hypothesis testing. Section [ill] is devoted 
to the derivation of upper bounds on the probability of error 
for general measurement matrices. We first derive an upper 
bound on the probability of error for the binary support 
recovery problem by employing the well-known Chernoff 
bound in detection theory [44] and extend it to multiple 
support recovery. We also study the effect of noise on system 
performance. In Section |IV| an information theoretic lower 
bound is given by using the Fano's inequality, and a necessary 
condition is shown for the DOA problem considered in [7]. 
We focus on the Gaussian ensemble in Section [V] Necessary 
and sufficient conditions on system parameters for accurate 
support recovery are given and their implications discussed. 
The paper is concluded in Section |VI] 

II. Notations, Models, and Preliminaries 
A. Notations 

We first introduce some notations used throughout this 
paper. Suppose x 6 F^ is a column vector. We denote by 
S = supp(x) C {1, ...,N} the support of x, which is 
defined as the set of indices corresponding to the non-zero 
components of x. For a matrix X, S = supp (X) denotes the 
index set of non-zero rows of X. Here the underlying field F 
can be assumed as R or C. We consider both real and complex 
cases simultaneously. For this purpose, we denote a constant 
K = 1 /2 or 1 for the real or complex case, respectively. 

Suppose S is an index set. We denote by l^l the number of 
elements in S. For any column vector x G ¥ N , x s G ¥ N is 
the vector in F' s formed by the components of x indicated by 
the index set S; for any matrix B, B s denotes the submatrix 
formed by picking the rows of B corresponding to indices in 
S, while Bs is the submatrix with columns from B indicated 
by S. If / and J are two index sets, then Bj = (B I )j, 
the submatrix of B with rows indicated by / and columns 
indicated by J. 

Transpose of a vector or matrix is denoted by ' while 
conjugate transpose by >. A ® B represents the Kronecker 
product of two matrices. For a vector v, diag(w) is the diagonal 
matrix with the elements of v in the diagonal. The identity 
matrix of dimension M is 1m- The trace of matrix A is given 
by tr(A), the determinant by \A\, and the rank by rank(A). 
Though the notation for determinant is inconsistent with that 
for cardinality of an index set, the exact meaning can always 
be understood from the context. 



Bold symbols are reserved for random vectors and matrices. 
We use P to denote the probability of an event and E the 
expectation. The underlying probability space can be inferred 
from the context. Gaussian distribution for a random vector in 
field F with mean /i and covariance matrix E is represented 
by FA^ (/z, E) . Matrix variate Gaussian distribution [45] for 
y g jpMxT w j tn mean g f MxT and covariance matrix 
E ® where E G ¥ MxM and * G F TxT , is denoted by 
WAf MiT (Q, E®*) 

Suppose {<3Vs}5£Li are two positive sequences, 

fn = o(g„) means that lim JWOC = 0. An alternative 
notation in this case is g n 3> /«■ We use /„ = 0(g n ) to denote 
that there exists anJVeN and C > independent of N such 
that f n < Cg n for n > N. Similarly, /„ = Vl(g n ) means 
fn > Cg n for n > N . These simple but expedient notations 
introduced by G. H. Hardy greatly simplify derivations [46]. 

B. Models 

Next, we introduce our mathematical model. Suppose 
x (i) G F N ,t = 1, ...,T are jointly sparse signals with 
common support; that is, only a few components of x (t) 
are non-zero and the indices corresponding to these non-zero 
components are the same for all t = 1, . . . , T. The common 
support S = supp (x (t)) has known size K = \S\. We assume 
that the vectors x s (t) ,t — 1, . . . ,T formed by the non-zero 
components of x(t) follow Ltd. FA/"(0, Ik)- The measurement 
model is as follows: 

y{t) = Ax{t)+w(t),t=l,2,...,T, (2) 

where A is the measurement matrix and y (t) G F M the 
measurements. The additive noise w (t) G F^ is assumed to 
follow i.i.d. ¥J\f (Q,(t 2 Im)- Note that assuming unit variance 
for signals loses no generality since only the ratio of signal 
variance to noise variance appears in all subsequence analyses. 
In this sense, we view 1/cr 2 as the signal-to-noise ratio (SNR). 

Let X = [x (1) x (2) • • • x (T)] and Y, W be de- 
fined in a similar manner. Then we write the model in the 
more compact matrix form: 

Y = AX + W. (3) 

We start our analysis for general measurement matrix A. 
For an arbitrary measurement matrix A G ¥ MxN , if every 
M x M submatrix of A is non-singular, we then call A a 
non-degenerate measurement matrix. In this case, the corre- 
sponding linear system Ax = b is said to have the Unique 
Representation Property (URP), the implication of which is 
discussed in [13]. While most of our results apply to general 
non-degenerate measurement matrices, we need to impose 
more structure on the measurement matrices in order to obtain 
more profound results. In particular, we will consider Gaussian 
measurement matrix A whose elements A mn are generated 
from i.i.d. FA/"(0, 1). However, since our performance analysis 
is carried out by conditioning on a particular realization of A, 
we still use non-bold A except in Section [V] The role played 
by the variance of A mn is indistinguishable from that of a 
signal variance and hence can be combined to 1/cr 2 , the SNR, 
by the note in the previous paragraph. 
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We now consider two hypothesis-testing problems. The first 
one is a binary support recovery problem: 

H : supp (X) = So 
Hi : supp (X) = Si 

The results we obtain for binary binary support recovery (0]i 
offer insight into our second problem: the multiple support 
recovery. In the multiple support recovery problem we choose 
one among distinct candidate supports of X, which is a 
multiple-hypothesis testing problem: 

H : supp (X) = S 
Hi : supp (X) = Si 

. ■ (5) 

H L _i : supp(X) = Sl-i 

C. Preliminaries for Hypothesis Testing 

We now briefly introduce the fundamentals of hypothesis 
testing. The following discussion is based mainly on [44]. In 
a simple binary hypothesis test, the goal is to determine which 
of two candidate distributions is the true one that generates the 
data matrix (or vector) Y: 

Ro-Y^p (Y\R ) 
Hi :Y ~p (Y\Ri) ■ 

There are two types of errors when one makes a choice 
based on the observed data Y. A false alarm corresponds 
to choosing Hi when Ho is true, while a miss happens by 
choosing Ho when Hi is true. The probabilities of these two 
types of errors are called the probability of a false alarm and 
the probability of a miss, which are denoted by 



(6) 



P F = P (Choose Hi |H ), 
P M = P (Choose H 1 Hi), 



(7) 
(8) 



respectively. Depending on whether one knows the prior 
probabilities P(Ho) and P(Hi) and assigns losses to errors, 
different criteria can be employed to derive the optimal de- 
cision rule. In this paper we adopt the probability of error 
with equal prior probabilities of Ho and Hi as the decision 
criterion; that is, we try to find the optimal decision rule by 
minimizing 

P err - F F P(Ho) + F M P(Hi) = ip F + X -P D . (9) 

The optimal decision rule is then given by the likelihood ratio 
test: 

W =log^-L^0 (10) 

P{Y\tio) H n 

where log(-) is the natural logarithm function. 

The probability of error associated with the optimal decision 
rule, namely, the likelihood ratio test d 1 Ob . is a measure of 
the best performance a system can achieve. In many cases 
of interest, the simple binary hypothesis testing problem © 
is derived from a signal-generation system. For example, in 
a digital communication system, hypotheses H and Hi cor- 
respond to the transmitter sending digit and 1, respectively, 
and the distributions of the observed data under the hypotheses 
are determined by the modulation method of the system. 



Therefore, the minimal probability of error achieved by the 
likelihood ratio test is a measure of the performance of the 
modulation method. For the problem addressed in this paper, 
the minimal probability of error reflects the measurement 
matrix's ability to distinguish different signal supports. 

The Chernoff bound [44] is a well-known tight upper bound 
on the probability of error. In many cases, the optimum test 
can be derived and implemented efficiently but an exact perfor- 
mance calculation is impossible. Even if such an expression 
can be derived, it is too complicated to be of practical use. 
For this reason, sometimes a simple bound turns out to be 
more useful in many problems of practical importance. The 
Chernoff bound, based on the moment generating function of 
the test statistic £(Y) (flOl l. provides an easy way to compute 
such a bound. 

Define /i(s) as the logarithm of the moment generating 
function of £(Y): 



fx(s) 4 log / e si ^p(Y\R Q )dY 

J — OO 

/oo 
b(y|H 1 )] s [p(r|Ho)] 1 - s dr. (ii) 
-oo 



Then the Chernoff bound states that 

Pp < exp[/x(s m )] < exp[//(s)], 
P M < exp[/i(s m )] < exp[/x(s)], 



and 



-Pcrr < ^ exp[fi(s m )} < - exp[/z(s) 



(12) 
(13) 

(14) 



where < s < 1 and s m = argmin 0<;j < 1 /z(s). Note that a 
refined argument gives the constant 1 /2 in (Tufl i instead of 1 as 
obtained by direct application of (fT2l and ( TT3l [44]. We use 
these bounds to study the performance of the support recovery 
problem. 

We next extend to multiple-hypothesis testing the key ele- 
ments of the binary hypothesis testing. The goal in a simple 
multiple-hypothesis testing problem is to make a choice among 
L distributions based on the observations: 



H : 
Hi : 



Y 
Y 



P(Y\R ) 
p(Y\Ki) 



(15) 



H L _ i: Y~p(Y\n L _i) 



Using the total probability of error as a decision criterion 
and assuming equal prior probabilities for all hypotheses, we 
obtain the optimal decision rule given by 



H* = argmax 0<l<i _ip {Y\R t ). 



(16) 



Application of the union bound and the Chernoff bound ( Tl4b 
shows that the total probability of error is bounded as follows: 



P — 

1 err — 



L-l 

E 1 

i=0 



5 (H* ^ Hi|Hi)P(Hi) 



L-1L-1 

^ ^E^expK S ;H„H J )],0< S <l,(17) 

i=0 j=0 
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where exp[/i(s; H^, Hj)] is the moment-generating function in 
the binary hypothesis testing problem for and H,. Hence, 
we obtain an upper bound for multiple-hypothesis testing from 
that for binary hypothesis testing. 

III. Upper Bound on Probability of Error for 
Non-degenerate Measurement Matrices 

In this section, we apply the general theory for hypothesis 
testing, the Chernoff bound on the probability of error in 
particular, to the support recovery problems and (|5}. We 
first study binary support recovery, which lays the foundation 
for the general support recovery problem. 

A. Binary Support Recovery 

Under model (0 and the assumptions pertaining to it, 
observations Y follow a matrix variate Gaussian distribution 
[45] when the true support is S: 



r|S , ~FJV M ,r(0 ) S s ®I r ) 
with the probability density function (pdf) given by 

P (X\S) 



(18) 



MT I 



■ exp 



(19) 

where Es = AsA^. + o- 2 Im is the common covariance matrix 
for each column of Y. The binary support recovery problem 
dUi is equivalent to a linear Gaussian binary hypothesis testing 
problem: 

H :Y ~EA/m,t(0,£ So <8It) 
Hx : Y~FA/"m, t (0,S Si (8It) ' 



(20) 



From now on, for notation simplicity we will denote E^ by 
Ej. The optimal decision rule with minimal probability of error 
given by the likelihood ratio test £(Y) ( TTOb reduces to 



— /ttr 



Hi 



Ft (E^ - E- 1 ) r - K Tlog^ I 0. (21) 



To analyze the performance of the likelihood ratio test (12 It . 
we first compute the log-moment-generating function of £(Y) 
according to (fTTT i: 



(22) 



log J hyihoixfiho)] 1 -^ 

log (Vk) kMT |£iI ksT |£o| k(1 - s)t 



exp 



{- 



y f (sE^ 1 + (1 - s)e; 



dY 



log 



|sE^ 1 + (l-s) Sp 1 !' 



(l-s)H- 



-kT log IsH 1 



< s < 1, (23) 



where H = Eq^S^ 1 !]^ 2 . The computation of the exact 
minimizer s m = argmin 0<s<1 /i(s) is non-trivial and will lead 
to an expression of /i(s m ) too complicated to handle. When 
1 5*0 1 = |<Si| and the columns of A are not highly correlated, 



for example in the case of A with Ltd. elements, s m « \. 
We then take s = \ in the Chernoff bounds ( fT2] i. ( [13] ), and 
(fT4l i. Whereas the bounds obtained in this way may not be the 
absolute best ones, they are still valid. 

As positive definite Hermitian matrices, H and H^ 1 can 
be simultaneously diagonalized by a unitary transformation. 
Suppose that the eigenvalues of are Ai > • ■ ■ > 



Afc > 
diag[Ai, . 
show that 



1 



. Afc , 1, . 



1 > (71 
., 1,(71, • • ■ 



> • 



Then it 



and L> = 
is easy to 



Ml/2) = —kT log 




.(24) 



Therefore, it is necessary to count the numbers of eigen- 
values of H that are greater than 1, equal to 1 and less than 
1, i.e., the values of ko and fci for general non-degenerate 
measurement matrix A. We have the following theorem on 
the eigenvalue structure of H: 

Proposition 1 For any non-degenerate measurement matrix 
A, let H - E^E^Eq 72 , fci = \S n S x \,k = |S \5i| = 
1 5*o I — fci, fci = |S'i\ < S'o| = l^il — fc; and assume M ^ fco + fci; 
then fco eigenvalues of matrix H are greater than 1, fci less 
than 1, and M — (fco + fci) equal to 1. 

Proof: See Appendix A. 

For binary support recovery with |5o| = |5i| = K, 
we have fco = fci = fcd- The subscripts i and d in fci and 
fcd are short for "intersection" and "difference", respectively. 
Employing the Chernoff bounds (TBI i and Proposition [T] we 
have 

Proposition 2 If M > 2fcd, the probability of error for the 
binary support recovery problem (|4|l is bounded by 



< 



1G 



tfe d T/2 



(25) 



where \Si,Sj is the geometric mean of the eigenvalues of H 
Ti 1 J 2 Ti~ 1 H 1 J 2 that are greater than one. 

Proof: According to ( fT4b and d24l ). we have 



-kT 




16 



i/fcd 



-Kk d T/2 
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Define Xs u Sj as the geometric mean of the eigenvalues of 
H = H i / Hj m at are greater than one. Then obvi- 

ously we have Xs ,Si — [ilj'Li j ■ Since if and 
E^E^E^ 2 have the same set of eigenvalues, l/o~j,j = 



1/2 -i 1/2 

1, . . . , fed are the eigenvalues of S x ' E that are greater 

than 1. We conclude that A Sl! s = (]lj=i d - ■ 

Note that Xs ,Si and Xsi,S completely determine the 
measurement system (f3])'s performance in differentiating two 
different signal supports. It must be larger than the constant 
16 for a vanishing bound when more temporal samples are 
taken. Once the threshold 16 is exceeded, taking more samples 
will drive the probability of error to exponentially fast. 
From numerical simulations and our results on the Gaussian 
measurement matrix, \si.S- does not vary much when Si, Sj 
and fed change, as long as the elements in the measurement 
matrix A are highly uncorrected. Q Therefore, quite appealing 
to intuition, the larger the size fed of the difference set between 
the two candidate supports, the smaller the probability of error. 

B. Multiple Support Recovery 

Now we are ready to use the union bound ( fTTI i to study 
the probability of error for the multiple support recovery 
problem ©. We assume each candidate support S{ has known 
cardinality K, and we have L — (^) such supports. Our 
general approach is also applicable to cases for which we have 
some prior information on the structure of the signal's sparsity 
pattern, for example the setup in model-based Compressive 
Sensing [47]. In these cases, we usually have L -C ( K ) 
supports, and a careful examination on the intersection pattern 
of these supports will give a better bound. However, in this 
paper we will not address this problem and will instead focus 
on the full support recovery problem with L = (^). Defining 
A = rniHi^j{As i) s }, we have the following theorem: 



Theorem 1 If M > 2K and A > 4[K {N - K )]~, then the 
probability of error for the full support recovery problem (O 
with \Si\ = K and L = („) is bounded by 



Po, 



K(N-K) 
2 _ K(N-K) 



1 

< -- 



(26) 



Proof: Combining the bound in Proposition [2] and Equation 
(fTTI i. we have 



L-1L-1 



< 



< 



2L^^ 



1 



i=0 j=l 
L-l L-l 



Asi.Sj As 3 ,s 1 
16 



-/tfc d T/2 



2L ^ ^ 

i=0 3 = 1 



Here fed depends on the supports Si and Sj. For fixed Si, the 
number of supports that have a difference set with Si with 

'Unfortunately, this is not the case when the columns of A are samples 
from uniform linear sensor array manifold. 



< K k 



cardinality fed is (^)( A \ K )- Therefore, using 
and ( ) [N ~ K) kd and the summation formula for 
geometric series, we obtain 



Po, 



< 



< 



< 



L-l K 



2L 



i=0 k d = l 



N-K 



i 



K 

2E 



K {N — K) 



K(N-K) 



2l 



K(N-K) 



We make several comments here. First, A depends solely 
on the measurement matrix A. Compared with the results 
in [43], where the bounds involve the signal, we get more 
insight into what quantity of the measurement matrix is 
important in support recovery. This information is obtained 
by modelling the signals x(t) as Gaussian random vectors. 
The quantity A effectively characterizes system (O's ability 
to distinguish different supports. Clearly, A is related to the 
restricted isometry property (RIP), which guarantees stable 
sparse signal recovery in Compressive Sensing [4]-[6]. We 
discuss the relationship between RIP and A for the special case 
with K = 1 at the end of Section IIII-CI However, a precise 
relationship for the general case is yet to be discovered. 

Second, we observe that increasing the number of tempo- 
ral samples plays two roles simultaneously in the measure- 
ment system. For one thing, it decreases the the threshold 
4[K(N - K)]^t that A must exceed for the bound ( t26b 
to hold. However, since lim T ^ 00 4[ J r\:(iV - K)]— = 4 for 
fixed K and N, increasing temporal samples can reduce 
the threshold only to a certain limit. For another, since the 
bound d26i l is proportional to e~ Tlog ( A / 4 ), the probability of 
error turns to exponentially fast as T increases, as long as 
A > 4 [K (N - K)}^ is satisfied. 

In addition, the final bound d26l > is of the same order as 
the probability of error when fed = 1. The probability of error 
Perr is dominated by the probability of error in cases for which 
the estimated support differs by only one index from the true 
support, which are the most difficult cases for the decision 
rule to make a choice. However, in practice we can imagine 
that these cases induce the least loss. Therefore, if we assign 
weights/costs to the errors based on fed, then the weighted 
probability of error or average cost would be much lower. For 
example, we can choose the costs to exponentially decrease 
when fed increases. Another possible choice of cost function 
is to assume zero cost when fed is below a certain critical 
number. Our results can be easily extended to these scenarios. 

Finally, note that our bound d26l ) applies to any non- 
degenerate matrix. In Section [V] we apply the bound to Gaus- 
sian measurement matrices. The additional structure allows us 
to derive more profound results on the behavior of the bound. 

C. The Effect of Noise 

In this subsection, we explore how the noise variance affects 
the probability of error, which is equivalent to analyzing the 
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behavior of As i: s. and A as indicated in ((25) and 

We now derive bounds on the eigenvalues of H. The lower 
bound is expressed in terms of the QR decomposition of a 
submatrix of the measurement matrix with the noise variance 
a 2 isolated. 



Proposition 3 For any non-degenerate measurement matrix 
A, let H = E^Z^Ej 72 with E l = A Si A s . + a 2 l M , h = 
\S nSx\,k = \So\Sx\ = \S \-kuh = \Sx\So\ = \Sx\-h. 
We have the following: 

1) // M ^ fco + k\, then the sorted eigenvalues of H 
that are greater than 1 are lower bounded by the 
corresponding eigenvalues of Ife + -^iR^R^, where 
i?33 is the ko x kg submatrix at the lower-right corner 
of the upper triangle matrix in the QR decomposition of 

[Asaso A sis 4s \,si] ; 

2) the eigenvalues of H are upper bounded by the cor- 
responding eigenvalues of 1m + 7 i 2'^So\'Si^s \s 1 >' ' n 
particular, the sorted eigenvalues of H that are greater 
than 1 are upper bounded by the corresponding ones of 

h + ^Al ( ^ Si A So \ Sl . 



Proof: See Appendix B. 



The importance of this proposition is twofold. First, by 
isolating the noise variance from the expression of matrix 
H, this theorem clearly shows that when noise variance 
decreases to zero, the relatively large eigenvalues of H will 
blow up, which results in increased performance in support 
recovery. Second, the bounds provide ways to analyze special 
measurement matrices, especially the Gaussian measurement 
ensemble discussed in Section [VI 

We have the following corollary: 



lower bounded by those of Ifc d + -^R^R}^; hence we have 



> 



> 



Ifcd + "^^33^33 
1/fe 



1/fcd 



hr, H — ttR 



X', 



> 




-, i/fcd 



i/fcd 



(29) 



where ru is the Zth diagonal element of i?33. For the second 
inequality we have used Fact 8. 11. 20 in [48]. Since A is 
non-degenerate and M > 2K, [A Sj \ Si A S:i Si A Si \ Sj ] is 
of full rank and rf t > 0, < I < kd for all Si, Sj. Defining 

(, \ l/fcd 

Il/=i r u ) s over a ^ possible 
support pairs Si , Sj , we then have c\ > and 



A > 1 



On the other hand, the upper bound on the eigenvalues of H 
yields 



Asi,.s 3 



< 
< 1 
= 1 



i/fc d 



1 



-tr ( A\ 



v S t \Si A Si\Si 



E 

l<rn<M 
n£Si\Sj 



\A r 



(30) 



Therefore, we have 



A < 1 



with c 2 = max S: | 5 |< K A Y,i<m<M \ A mn \ 2 . All other state- 

n£S j — . . — . 
ments in the theorem follows immediately from J25I ) and d26| ). 



Corollary 1 For support recovery problems (O and ([5]) with 
support size K, suppose M > 2K; then there exist constants 
c\ , C2 > that depend only on the measurement matrix A 
such that 



l + ^>A>l + % 



(27) 



From d25l ) and d26l ), we then conclude that for any temporal 
sample size T 



lim P err = 

cr 2 ^0 



(28) 



and the speed of convergence is approximately ( C r 2 ) Kfc <iT an( j 
(a 2 ) KT for the binary and multiple cases, respectively. 



Proof: According to Proposition [3] for any fixed Si, Sj, the 

X /2 i 1/2 

eigenvalues of H = Y^' £^ E 4 ' that are greater than 1 are 



Corollary [T] suggests that in the limiting case where there is 
no noise, M > 2K is sufficient to recover a if— sparse signal. 
This fact has been observed in [4]. Our result also shows that 
the optimal decision rule, which is unfortunately inefficient, 
is robust to noise. Another extreme case is when the noise 
variance a 2 is very large. Then from log(l+x) « x, < x « 
1, the bounds in d25b and ( l26b are approximated by e ~ Kk <i T /< J 
and e~ KT l . Therefore, the convergence exponents for the 
bounds are proportional to the SNR in this limiting case. 

The diagonal elements of R33, ru's, have clear meanings. 
Since QR factorization is equivalent to the Gram-Schmidt 
orthogonalization procedure, ru is the distance of the first 
column of AgjSj to the subspace spanned by the columns 
of As/, r-22 is the distance of the second column of As^Sj 
to the subspace spanned by the columns of Ag plus the first 
column of Ag.^ Sj , and so on. Therefore, Ag i ^ is a measure 
of how well the columns of Ag./ S . can be expressed by 
the columns of Asj, or, put another way, a measure of the 
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incoherence between the columns of As ( and As, . Similarly, 
A is an indicator of the incoherence of the entire matrix A of 
order K. 

To relate A with the incoherence, we consider the case with 
K = 1 and F = R. By restricting our attention to matrices with 
unit columns, the above discussion implies that a better bound 
is achieved if the minimal distance of all pairs of column 
vectors of matrix A is maximized. Finding such a matrix A is 
equivalent to finding a matrix with the inner product between 
columns as large as possible, since the distance between two 
unit vectors u and i; is 2 — 2\ < u, v > \ where < u, v >= u'v 
is the inner product between u and v. For each integer s, the 
RIP constant S s is defined as the smallest number such that 
[4], [5]: 



1-<L< 



\Ax\\l 



< 1 



|supp(a;) 



(31) 



A direct computation shows that 82 is equal to the minimum of 
the absolute values of the inner products between all pairs of 
columns of A. Hence, the requirements of finding the smallest 
$2 that satisfies (l3lT l and maximizing A coincide when K = 
1. For general K, Milenkovic et.al. established a relationship 
between 82 and 5k via Gersgorin's disc theorem [49] and 
discussed them as well as some coding theoretic issues in 
Compressive Sensing context [50]. 



IV. An Information Theoretic Lower Bound on 
Probability of Error 

In this section, we derive an information theoretic lower 
bound on the probability of error for any decision rule in the 
multiple support recovery problem. The main tool is a variant 
of the well-known Fano's inequality [51]. In the variant, the 
average probability of error in a multiple-hypothesis testing 
problem is bounded in terms of the Kullback-Leibler diver- 
gence [52]. Suppose that we have a random vector or matrix Y 
with L possible densities /o, . . . , Jl-i- Denote the average of 
the Kullback-Leibler divergence between any pair of densities 
by 



(32) 



Then by Fano's inequality [53], [43], the probability of error 
(TTTb for any decision rule to identify the true density is lower 
bounded by 

/3 + log2 



P > 1 

1 err _ x 



logi 



(33) 



closed form expression: 

D KL (f i \\f j )=E fi log^ 



1 



— >ttr 



Y f (E7 1 - S7 1 ) Y 



kT log ■ 



tr( J ff„--I M )+log|-/| 



1/2 1 1/2 

where H itj = £7 ly . Therefore, we obtain the average 
Kullback-Leibler divergence d32l ) for the multiple support 
recovery problem as 







1 1 

^7 V -kT 

s it s, 

kT 



tr(ff 4J )-M + log^ 



= 2P T,MHi,j)-M] 



where the log 1^4 terms all cancel out and L = (^) . Invoking 
the second part of Proposition [3j we get 



1 . .+ 

■^2 A Si\S s A SAS . 



tr (Hij) < tr[I M 



l<m<M 
neSi\Sj 

Therefore, the average Kullback-Leibler divergence is bounded 
by 



kT 

s i>Sj l< m <M 



Due to the symmetry of the right-hand side, it must be of 



the form aY] IAm.nl 



a||A|| F , where || • ||f is the 



l<n<Af 

Frobenius norm. Setting all A mn = 1 gives 



kT 



E E 1 

Si,Sj i< m <M 
neSi\Sj 



= aMN. 

Therefore, we get a — ^-^iwr^^ using the mean expression 
for hypergeometric distribution: 

K (K\(N-K\ 
\kj \ fed / 7 



E 

fc d =i 

Hence, we have 



K{N -K) 
N ' 



Since in the multiple support recovery problem ©, all the 
distributions involved are matrix variate Gaussian distributions 
with zero mean and different variances, we now compute the 
Kullback-Leibler divergence between two matrix variate Gaus- 
sian distributions. Suppose = FA/"m,t(0, £, <£> Ir)ifj = 
FA/m,t(0, £j 8> It), me Kullback-Leibler divergence has We conclude with the following theorem: 



kTK(N-K) 2 

P< 2(J 2 N 2 W A h- 

Therefore, the probability of error is lower bounded by 

kTK(N-K) 11 j 11 2 . , 

P„>l- -^»f g±*e (34) 
logi 
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Theorem 2 For multiple support recovery problem ([5). the 
probability of error for any decision rule is lower bounded by 



f err > 1 - 



K T§ 



o(l). 



(35) 



Each term in bound (|33b has clear meanings. The Frobenius 
norm of measurement matrix || A\\p is total gain of system (0. 
Since the measured signal is K— sparse, only a fraction of the 
gain plays a role in the measurement, and its average over all 
possible K— sparse signals is ^ ||^4||p- While an increase in 
signal energy enlarges the distances between signals, a penalty 
term (l — is introduced because we now have more 
signals. The term log L = log (^) is the total uncertainty or 
entropy of the support variable S, since we impose a uniform 
prior on it. As long as K < increasing K increases both 
the average gain exploited by the measurement system, and the 
entropy of the support variable S. The overall effect, quite 
counterintuitively, is a decrease of the lower bound in ([35). 

N V 1 N I 

an increasing function j^jj^ 
entropy function H(a) = —a log a 



Actually, the term involving K 

q(1 



is approximated by 

with a = and the binary 
(1 - a)log(l - a). 
The reason for the decrease of the bound is that the bound 
only involves the effective SNR without regard to any inner 
structure of A (e.g. the incoherence) and the effective SNR 
increases with K. To see this, we compute the effective SNR 



as 



\\Aa 



Ma 2 



Ma 2 



If we scale down the effective SNR 



through increasing the noise energy a 2 by a factor of K, then 
the bound is strictly increasing with K. 

The above analysis suggests that the lower bound d35l l 
is weak as it disregards any incoherence property of the 
measurement matrix A. For some cases, the bound reduces 
to 2na 2 log £ (refer to Corollary [2] Theorem [3] and |4]i and 
is less than K when the noise level or K is relatively large. 
Certainly recovering the support is not possible with fewer 
than K measurements. The bound is loose also in the sense 
that when T, \\A\\^, or the SNR 1/rr 2 is large enough the 
bound becomes negative, but when there is noise, perfect 
support recovery is generally impossible. While the original 
Fano's inequality 



H(P e , 



log(L-l)>H(S\Y) 



(36) 



is tight in some cases [51], the adoption of the average 
divergence d32l as an upper bound on the mutual information 
/ (S; Y) between the random support S and the observation 
Y reduces the tightness (see the proof of d3~3b in [54]). Due to 
the difficulty of computing H(S\Y) and I(S; Y) analytically, 
it is not clear whether a direction application of d36i > results 
in a significantly better bound. 

Despite of its drawbacks we discussed, the bound d35l l 
identifies the importance of the gain ||j4||| of the measurement 
matrix, a quantity usually ignored in, for example, Compres- 
sive Sensing. We can also draw some interesting conclusions 
from d35l ) for measurement matrices with special properties. In 
particular, in the following corollary, we consider measurement 
matrices with rows or columns normalized to one. The rows of 
a measurement matrix are normalized to one in sensor network 



scenario (SNET) where each sensor is power limited while 
the columns are sometimes normalized to one in Compressive 
Sensing (refer to [42] and references therein). 

Corollary 2 In order to have a probability of error P crr < e 
with < e < 1, the number of measurements must satisfy: 



2a 2 K\og% 



K lv( 1 j\f) 

,8cr 2 N 

> (i_ £ ) Klog-+o(l), (37) 



if the rows of A have unit norm; and 
T > (l- e ) 



2 ^ ■ m 



> (i_ e) 2£! log ^ + o(1)j 



(38) 



if the columns of A have unit norm. 



Note that the necessary condition ( f37T > has the same critical 
quantity as the sufficient condition in Compressive Sensing. 
The inequality in d3~8l is independent of M. Therefore, if the 
columns are normalized to have unit norm, it is necessary 
to have multiple temporal measurements for a vanishing 
probability of error. Refer to Theorem [3] and |4] and discussions 
following them. 

In the work of [7], each column of A is the array manifold 
vector function evaluated at a sample of the DOA parameter. 
The implication of the bound ( f35l > for optimal design is that we 
should construct an array whose geometry leads to maximal 
||A||p. However, under the narrowband signal assumption and 
narrowband array assumption [55], the array manifold vector 
for isotropic sensor arrays always has norm \J~M [56], which 
means that ||j4|| f = MN. Hence in this case, the probability 
of error is always bounded by 



P > 1 — 

± err -L 



T 



1 



) MN 



2a2 log u , 
Therefore, we have the following theorem, 



+ (1). 



(39) 



Theorem 3 Under the narrowband signal assumption and 
narrowband array assumption, for an isotropic sensor array 
in the DOA estimation scheme proposed in [7], in order to 
let the probability of error P 0IT < e with < e < 1 for any 
decision rule, the number of measurements must satisfy the 
following: 



MT > (l-e) 



2cr 2 log 



N 



> (1-E)2cr 2 k>£ 



o(l) 



o(l). 



(40) 



We comment that the same lower bound applies to Fourier 
measurement matrix (not normalized by l/VM) due to the 
same line of argument. We will not explicitly present this result 
in the current paper. 

Since in radar and sonar applications the number of targets 
K is usually small, our result shows that the number of 
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samples is lower bounded by log TV. Note that N is the 
number of intervals we use to divide the whole range of DOA; 
hence, it is a measure of resolution. Therefore, the number of 
samples only needs to increase in the logarithm of N, which 
is very desirable. The symmetric roles played by M and T 
are also desirable since M is the number of sensors and is 
expensive to increase. As a consequence, we simply increase 
the number of samples to achieve a desired probability of error. 
In addition, unlike the upper bound of Theorem Q] we do not 
need to assume that M > 2K in Theorem [2] and [3] Actually, 
Malioutov et.al. made the empirical observation that i\— SVD 
technique can resolve M — 1 sources if they are well separated 
[7]. Theorem [3] still applies to this extreme case. 

Analysis of support recovery problem with measurement 
matrix obtained from sampling a manifold has considerable 
complexity compared with the Gaussian case. For example, it 
presents significant challenge to estimate \si,s m the DOA 
problem except for a few special cases that we discuss in [57]. 
As we mentioned before, unlike the Gaussian case, \Si,S f° r 
uniform linear arrays varies greatly with Si and Sj. Therefore, 
even if we can compute \si,s ■, replacing it with A in the upper 
bound of Theorem Q] would lead to a very loose bound. On 
the other hand, the lower bound of Theorem [2] only involves 
the Frobenius norm of the measurement matrix, so we apply 
it to the DOA problem effortlessly. However, the lower bound 
is weak as it does not exploit any inner structure of the 
measurement matrix. 



Donoho et.al. considered the recovery of a "sparse" wide- 
band signal from narrow-band measurements [58], [59], a 
problem with essentially the same mathematical structure 
when we sample the array manifold uniformly in the wave 
number domain instead of the DOA domain. It was found 
that the spectral norm of the product of the band-limiting 
and time-limiting operators is crucial to stable signal recovery 
measured by the l 2 norm. In [58], Donoho and Stark bounded 
the spectral norm using the Frobenius norm, which leads to 
the well-known uncertainty principle. The authors commented 
that the uncertainty principle condition demands an extreme 
degree of sparsity for the signal. However, this condition can 
be relaxed if the signal support are widely scattered. In [7], 
Malioutov et.al. also observed from numerical simulations 
that the l\ — SVD algorithm performs much better when the 
sources are well separated than when they are located close 
together. In particular, they observed that presence of bias 
is mitigated greatly when sources are far apart. Donoho and 
Logan [59] explored the effect of the scattering of the signal 
support by using the "analytic principle of the large sieve". 
They bounded the spectral norm for the limiting operator by 
the maximum Nyquist density, a quantity that measures the 
degree of scattering of the signal support. We expect that our 
results can be improved in a similar manner. The challenges 
include using support recovery as a performance measure, 
incorporating multiple measurements, as well as developing 
the whole theory within a probabilistic framework. 



V. Support Recovery for the Gaussian 
Measurement Ensemble 

In this section, we refine our results in previous sections 
from general non-degenerate measurement matrices to the 
Gaussian ensemble. Unless otherwise specified, we always 
assume that the elements in a measurement matrix A are Ltd. 
samples from unit variance real or complex Gaussian distri- 
butions. The Gaussian measurement ensemble is widely used 
and studied in Compressive Sensing [2]-[6]. The additional 
structure and the theoretical tools available enable us to derive 
deeper results in this case. In this section, we assume general 
scaling of (N, M, K, T). We do not find in our results a clear 
distinction between the regime of sublinear sparsity and the 
regime of linear sparsity as the one discussed in [43]. 

We first show two corollaries on the eigenvalue structure 
for the Gaussian measurement ensemble. Then we derive 
sufficient and necessary conditions in terms of M, N, K and 
T for the system to have a vanishing probability of error. 

A. Eigenvalue Structure for a Gaussian Measurement Matrix 

First, we observe that a Gaussian measurement matrix is 
non-degenerate with probability one, since any p < M random 
vectors a 1 ,a 2 ,...,a p from FJV(0,E) with £ G K MxM 
positive definite are linearly independent with probability one 
(refer to Theorem 3.2.1 in [45]). As a consequence, we have 

Corollary 3 For Gaussian measurement matrix A, let H = 
S^E^E^ 2 , ki = l-SbnSil.fo = \S0\S1\ = \So\-kuh = 
\Si\Sq\ = \Si\ — k{. If M ^ ko + k\, then with probability 
one, ko eigenvalues of matrix H are greater than 1, k\ less 
than 1, and M — (fco + fei) equal to 1. 

We refine Proposition [3] based on the well-known QR 
factorization for Gaussian matrices [45], [60]. 

Corollary 4 With the same notations as in Corollary\3\ then 
with probability one, we have: 

1) if M ^ fco + k\, then the sorted eigenvalues of H 
that are greater than 1 are lower bounded by the 
corresponding ones of I^ + -^R^R^, where the 
elements of R 33 = {r mn ) koXkg satisfy: 

^ Kr mn ~ X2re(M-fei-fej— ro+1)' ^ — m — ^0, 

— VN (0, 1) , 1 < to < n < k . 

2) the eigenvalues of H are upper bounded by the cor- 
responding eigenvalues of 1m + ^'^■So\S'i^-s \Si'' 
particular, the sorted eigenvalues of H that are greater 

than 1 are upper bounded by the corresponding ones of 

if 

Ifeo + ^A So ^ Si A SoXSl . 

Now with the distributions on the elements of the bounding 
matrices, we can give sharp estimate on Xsi,s ■■ In particular, 
we have the following proposition: 

Proposition 4 For Gaussian measurement matrix A, suppose 

Si and Sj are a pair of distinct supports with the same size 

K. Then we have 

M - M-K- fe d 
1 + — > EA Si . s , > 1 + - 2 -. 
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Proof: We copy the inequalities d29l ), ( f30l > on Xs it s j here: 



k E I* 



cr 2 fc d 



> As„S, > 1- 



l<m<M 



\m=l / 



1/fe, 



The proof then reduces to the computation of two expecta- 
tions, one of which is trivial: 



E- 



1 y u f-^ 



l<m<M 
n£So\Si 



Next, the independence of the r nn 's and the convexity of 
exponential functions together with Jensen's inequality yield 

i/fcd 



1 



> 



2recr 2 
1 

1kg- 



Eexp 



fed 

-£log(2 K r 2 n ) 



n=l 



■ exp 



-^Elog(2 K rL) 



n=l 



Since (2Kr 2 



rm) ~ xL(M-K-n+i)' ±e expectation of log- 
arithm is Elog (2nrl n ) = log 2 + tp (k(M -K-n+ 1)), 
where ip (z) ~ is the digamma function. Note that tfj (z) 
is increasing and satisfies ip [z + 1) > logz. Therefore, we 
have 

1 f fed \ 1/fcd 

E ^ II 



> 



> 



1 



2kct 2 
1 



\n=l 



exp 



fed 

log 2 + — V ip (k(M -K-n+1)) 
fc d ^ 



1 



exp [ip (k(M - K - fe d + 1))] 



> — exp [log (k(M- if -fe d ))] 
M-K- fed 



> 



The expected value of the critical quantity Xst.Sj lies 
between 1 + M ~$ K and 1 + linearly proportional to A/. 
Note that in conventional Compressive Sensing, the variance 
of the elements of A is usually taken to be A, which is 
equivalent to scaling the noise variance a 2 to Ma 2 in our 

- 1 — 2 — 

model. The resultant As^.s- is then centered between H — 
and 1 + 4j. 

B. Necessary Condition 

One fundamental problem in Compressive Sensing is how 
many samples should the system take to guarantee a sta- 
ble reconstruction. Although many sufficient conditions are 
available, non-trivial necessary conditions are rare. Besides, 
in previous works, stable reconstruction has been measured in 
the sense of l p norms between the reconstructed signal and the 



true signal. In this section, we derive two necessary conditions 
on M and T in terms of N and K in order to guarantee 
respectively that, first, EP crr turns to zeros and, second, for 
majority realizations of A, the probability of error vanishes. 
More precisely, we have the following theorem: 



Theorem 4 In the support recovery problem ©, for any 
s, 5 > 0, a necessary condition of EP err < e is 



MT > (1 - e) 



2a 2 


log 


(5) 


kK 


(1- 


-f) 


2a 2 


log 


N 


K 


K + 



(42) 



and a necessary condition of P {P C1T (A) < e} > 1 — S is 

MT > (l-e-5) 



> (i_ e _j)^i og ^ + (i). (44) 



Proof: Equation <35J and E\\A\\ 2 F = £ m . ; E|A m/ | 
MN give 



kT % (1 " §) MN 



+ o(l). (45) 



Hence, EP crr < e entails (@T|i and (l42l . 

Denote by E the event {A : P err (A) < e}; then P {P c } < 
(5 and we have 



EP R , 



/ Perr (A) + / P crr (A) 



< eP (E) + P (P c ) 

< e + S. 

Therefore, from the first part of the theorem, we obtain d43l ) 
and (04]l. ■ 

We compare our results with those of [43] and [42]. As 
we mentioned in the introduction, the differences in problem 
modeling and the definition of the probability of error make 
a direct comparison difficult. We first note that Theorem 2 
in [43] is established for the restricted problem where it is 
known a priori that all non-zero components in the sparse 
signal are equal. Because the set of signal realizations with 
equal non-zero components is a rare event in our signal model, 
it is not fitting to compare our result with the corresponding 
one in [43] by computing the distribution of the smallest on- 
support element, e.g., the expectation. Actually, the square of 
the smallest on-support element for the restricted problem, 
A4 2 ([3*) (or (3 in [42]), is equivalent to the signal variance 
in our model: both are measures of the signal energy. If we 
take into account the noise variance and replace A4 2 ((3*) (or 
/3 2 SNR in [42]) with 1/cr 2 , the necessary conditions in these 
papers coincide with ours when only one temporal sample is 
available. 

Our result shows that as far as support recovery is con- 
cerned, one cannot avoid the log $ term when only given 
one temporal sample. Worse, for conventional Compressive 



TO APPEAR ON IEEE TRANS. INFORMATION THEORY 



12 



Sensing with a measurement matrix generated from a Gaussian 
random variable with variance 1/M, the necessary condition 
becomes 



T > 



kK(1-%) 



o(l) 



2a 2 , N 
> —log -+o(l), 

which is independent of M, Therefore, when there is consid- 
erable noise ( a 2 > «;/(21og^) ), it is impossible to have 
a vanishing EP orr no matter how large an M one takes. 
Basically this situation arises because while taking more 
samples, one scales down the measurement gains A m i, which 
effectively reduces the SNR and thus is not helpful in support 
recovery. As discussed below Theorem [3] log ({^) is the 
uncertainty of the support variable S, and log ^ actually 
comes from it. Therefore, it is no surprise that the number 
of samples is determined by this quantity and cannot be made 
independent of it. 



C. Sufficient Condition 

We derive a sufficient condition in parallel with sufficient 
conditions in Compressive Sensing. In Compressive Sens- 
ing, when only one temporal sample is available, M = 
fi (-RTlog -jS) is enough for stable signal reconstruction for 
the majority of the realizations of measurement matrix A 
from a Gaussian ensemble with variance A. As shown in 
the previous subsection, if we take the probability of error for 
support recovery as a performance measure, it is impossible in 
this case to recover the support with a vanishing probability 
of error unless the noise is small. Therefore, we consider a 
Gaussian ensemble with unit variance. We first establish a 
lemma to estimate the lower tail of the distribution for Xs il s j - 
We have shown that the E (Xs^Sj) lie between 1 + M ~i K and 
1 + 4- When 7 is much less than 1 + — 2K , we expect that 
IP {Asi.Sj < 7} decays quickly. More specifically, we have the 
following large deviation lemma: 

Lemma 1 Suppose that 7 = h M ~% K ■ Then there exists 
constant c > such that for M — 2K sufficiently large, we 
have 

P{A Si) s 3 . < 7} < exp [-c(M - 2K)] . 

This large deviation lemma together with the union bound 
yield the following sufficient condition for support recovery: 



Theorem 5 Suppose that 



M = n[ K log 



and 



N 
K 



M 

K Tlog^>log [K (N — K)} . 



(46) 



(47) 



Then given any realization of measurement matrix A from a 
Gaussian ensemble, the optimal decision rule H6\ for multiple 
support recovery problem © has a vanishing P e rr with 



probability turning to one. In particular, if M = Vl (.fT log tXI 



and 



log N 



(48) 



log log N ' 

then the probability of error turns to zero as N turns to infinity. 



Proof: Denote 7=3 
bound, we have 



.vf 2h_^ T}j en according to the union 



'{A< 7 } 

U [W 3 <7] 

Therefore, application of Lemma Q] gives 
P{A < 7} 



< 



N 



< exp 



Kexp{-c(M -2K)} 

N 

-c (M — 2K) + 2ATlog — + log K 
K 



Hence, as long as M = Sl(ATlog^), we know that the 
exponent turns to — 00 as N — ► 00. We now define E = 
{A : A (A) > 7}, where P{_E} approaches one as N turns 
to infinity. Now the upper bound d26l i becomes 



O 



o 



K (N — K) 



kT 



K (N — K) 



Hence, if ^Tlog ^ 3> log [K (N — K)], we get a vanishing 
probability of error. In particular, under the assumption that 

< 



M > n (iOogf ), if T > 
TS H| 77 implies that K ( N - K) « O 



, then M*^-*)] 

log log N i og [k log 

N \ KT' 



Klog- 



n 1 K(N-K) 

1 (5T 



for suitably selected constants. 



We now consider several special cases and explore the 
implications of the sufficient conditions. The discussions are 
heuristic in nature and their validity requires further checking. 

If we set T = 1, then we need M to be much greater than 
N to guarantee a vanishing probability P err . This restriction 
suggests that even if we have more observations than the 
original signal length N, in which case we can obtain the 
original sparse signal by solving a least squares problem, we 
still might not be able to get the correct support because of the 
noise, as long as M is not sufficiently large compared to N. 
We discussed in the introduction that for many applications, 
the support of a signal has significant physical implications 
and its correct recovery is of crucial importance. Therefore, 
without multiple temporal samples and with moderate noise, 
the scheme proposed by Compressive Sensing is questionable 
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as far as support recovery is concerned. Worse, if we set the 
variance for the elements in A to be 1/M as in Compressive 
Sensing, which is equivalent to replacing a 2 with Ma 2 , even 
increasing the number of temporal samples will not improve 
the probability of error significantly unless the noise variance 
is very small. Hence, using support recovery as a criterion, 
one cannot expect the Compressive Sensing scheme to work 
very well in the low SNR case. This conclusion is not a 
surprise, since we reduce the number of samples to achieve 
compression. 

Another special case is when K — 1. In this case, the 
sufficient condition becomes M > log N and kT log 3> 
log TV. Now the number of total samples should satisfy 
MT ^> / lo ? N \ T while the necessary condition states that 

log log N J 

MT = SI (log N) . The smallest gap between the necessary 
condition and sufficient condition is achieved when K = 1. 

From a denoising perspective, Fletcher et.al. [61] upper 
bounded and approximated the probability of error for support 
recovery averaged over the Gaussian ensemble. The bound 
and its approximation are applicable only to the special case 
with K = 1 and involve complex integrals. The authors 
obtained interesting SNR threshold as a function of M, N 
and K through the analytical bound. Note that our bounds are 
valid for general K and have a simple form. Besides, most of 
our derivation is conditioned on a realization of the Gaussian 
measurement ensemble. The conditioning makes more sense 
than averaging since in practice we usually make observations 
with fixed sensing matrix and varying signals and noise. 

The result of Theorem [5] also exhibits several interesting 
properties in the general case. Compared with the necessary 
condition d43l > and d44i >, the asymmetry in the sufficient 
condition is even more desirable in most cases because of the 
asymmetric cost associated with sensors and temporal samples. 
Once the threshold K log of M is exceeded, we can achieve 
a desired probability of error by taking more temporal samples. 
If we were concerned only with total the number of samples, 
we would minimize MT subject to the constraints (|46T > and 
(l47l i to achieve a given level of probability of error. However, 
in applications for which timing is important, one has to 
increase sensors to reduce P CII to a certain limit. 

The sufficient condition 1461 . 14711 . and (l48l l is separable 
in the following sense. We observe from the proof that the 
requirement M = fl (K log is used only to guarantee that 
the randomly generated measurement matrix is a good one in 
the sense that its incoherence A is sufficiently large, as in the 
case of Compressive Sensing. It is in Lemma Q] that we use 
the Gaussian ensemble assumption. If another deterministic 
construction procedure (for attempts in this direction, see [62]) 
or random distribution give measurement matrix with better 
incoherence A, it would be possible to reduce the orders for 
both M and T. 



VI. Conclusions 

In this paper, we formulated the support recovery problems 
for jointly sparse signals as binary and multiple-hypothesis 
testings. Adopting the probability of error as the performance 
criterion, the optimal decision rules are given by the likelihood 



ratio test and the maximum a posteriori probability estimator. 
The latter reduces to the maximum likelihood estimator when 
equal prior probabilities are assigned to the supports. We 
then employed the Chernoff bound and Fano's inequality to 
derive bounds on the probability of error. We discussed the 
implications of these bounds at the end of Section IIII-BI 
Section lITFCl Section |IV] Section IV-Bl and Section IV^Cl 
in particular when they are applied to the DOA estimation 
problem considered in [7] and Compressive Sensing with a 
Gaussian measurement ensemble. We derived sufficient and 
necessary conditions for Compressive Sensing using Gaussian 
measurement matrices to achieve a vanishing probability of 
error in both the mean and large probability senses. These 
conditions show the necessity of considering multiple temporal 
samples. The symmetric and asymmetric roles played by 
the spatial and temporal samples and their implications in 
system design were discussed. For Compressive Sensing, we 
demonstrated that it is impossible to obtain accurate signal 
support with only one temporal sample if the variance for the 
Gaussian measurement matrix scales with 1/M and there is 
considerable noise. 

This research on support recovery for jointly sparse signals 
is far from complete. Several questions remain to be answered. 
First, we notice an obvious gap between the necessary and 
sufficient conditions even in the simplest case with K = 1. 
Better techniques need to be introduced to refine the re- 
sults. Second, as in the case for RIP, computation of the 
quantity A for an arbitrary measurement matrix is extremely 
difficult. Although we derive large derivation bounds on A 
and compute the expected value for \si,S for the Gaussian 
ensemble, its behaviors in both the general and Gaussian cases 
require further study. Its relationship with RIP also needs to 
be clarified. Finally, our lower bound derived from Fano's 
inequality identifies only the effect of the total gain. The 
effect of the measurement matrix's incoherence is elusive. The 
answers to these questions will enhance our understanding of 
the measurement mechanism (f2]). 



Appendix A 
Proof of ProposittonQ] 

In this proof, we focus on the case for which both fen. ^ 
and k\ ^ 0. Other cases have similar and simpler proofs. The 
eigenvalues of H satisfy \\Im — H\ =0, which is equivalent 
to |AEi — Eo| =0. The substitution A = fi + 1 defines 

g (fi) = |(m + 1) Ex - So| = |/iEi - (E - E x )| . 

The following algebraic manipulation 



G 



So — Si 



As ns 1 Al gnSi 



Aso\SiAso\S! 



AsonSiAsonSi + A Si\s A Sl 



\s 



Aso\Si4o\Si 



AsaSo^saso 
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leads to 



9 O) 



|Si| 



'GE 1 



It 



|Si| 



Therefore, to prove the theorem, it suffices to show that 

_i _i + 

T, 1 2 G'E 1 2 has fco positive eigenvalues, fci negative eigen- 
values and M — (fco + fci) zero eigenvalues or, put another 
way, E^GE"' 1 has inertia (k , fci, M - (fc + ki)). The 
Sylvester's law of inertia ( [49], Theorem 4.5.8, p. 223) 
states that the inertia of a symmetric matrix is invariant under 
congruence transformations. Hence, we need only to show that 
G has inertia (fc , k\, M - (fc + fci)). Clearly G = PQ^ with 



P = A 



S \Si 



A SASo ] and Q = [ A So 



\Si 



-A 



Si\S 



To find the number of zero eigenvalues of G, we calculate the 
rank of G. The non-degenerateness of measurement matrix A 
implies that rank (P) = rank (Q) = fc + fci. Therefore, from 
rank inequality ( [49], Theorem 0.4.5, p. 13), 

rank (P) + rank (Q f ) - (k + ki) 
< rank (PQ f ) 

< min {rank (P) , rank (<3 + )} , 

we conclude that rank (G) = fco + fa • 

To count the number of negative eigenvalues of G, we use 
the Jocobi-Sturm rule ( [63], Theorem A. 1.4, p. 320), which 
states that for an M x M symmetric matrix whose jth leading 
principal minor has determinant dj , j = 1, . . . , M, the number 
of nonnegative eigenvalues is equal to the number of sign 
changes of sequence {1, d\, . . . , c?m}- We consider only the 
first fco+fci leading principal minors, since higher order minors 
have determinant 0. 

Suppose / = {1, . . . , ko + ki} is an index set. Without loss 
of generality, we assume that P 1 is nonsingular. Applying QL 
factorization (one variation of QR factorization, see [64]) to 
matrix P 7 , we obtain P 1 = OL, where O is an orthogonal 



matrix, OO f = I 



fco+fci 



and L = (Z^) 



(fc +fci)x(/c +fci) 



is an 



lower triangular matrix. The diagonal entries of L are nonzero 
since P 1 is nonsingular. The partition of L into 



L=[Li L 2 ] 

with Lj € F( feo+fel ) xfeo ,L 2 € W^ k « +k ^ xkl , and L 2 = 
with L 3 e ¥ klXkl implies 







G\ = P J (Q J ) t =0[L X L 2 ] 



-Lt 



Again using the invariance property of inertia under congru- 
ence transformation, we focus on the leading principal minors 



of U = [ ii L 2 ] 



4 



. Suppose J = {1, . . . ,j}. For 



For fco + 1 < j < ko + fci, suppose J = {1, . . . , fco} and 
Ji = {1, . . . , j — fc }. We then have 



\U J T \ 



{L,) J j\ 



(-ly-^UL.Yji (l 3 ) 



(-1) 



3-k 



t[\h 



Therefore, the sequence 1, di, c?2, • * " ^fe +fci nas &i s ^S, n 
changes, which implies that Gj — hence G — has fci negative 
eigenvalues. Finally, we conclude that the theorem holds for 
H. 



Appendix B 
Proof of Proposition[3] 

We first prove the first claim. From the proof of Proposition 
Q] it suffices to show that the sorted positive eigenvalues of 
Ej^ 2 GYi 1 2 are greater than those of -^R 33 R 33 , where G = 



Since cyclic permutation of 
a matrix product does not change its eigenvalues, we restrict 
ourselves to Ej~ 1 G. Consider the QR decomposition 



i-S \S 



= [Qi Q2 Qs Qi] 



QR 

Ru R12 R13 

R 2 2 R23 

P33 





where Q £ F MxM is an orthogonal matrix with parti- 
tions Qi e F Mxfci ,Q 2 g F Mxfe ',Q 3 e w Mxk \ R e 

jpMx(fei+fei+fe ) j s an U pp er triangular matrix with partitions 
Pn e F fclXfel ,P 2 2 G ¥ kiXki ,R 33 e F feoXfeo , and other 
submatrices have corresponding dimensions. 
First, we note that 



Q f GQ 

Pl3 ^11 

P23 
P33 




R 



13 



R 23 R 



Rl3 

R23 



Ru 





-4i 



xi-m Ji 23 



33 



13 







[p 33 pl 3 P 3 3-R 23 ] 




Pl 3 p 33 

P 2 3^? 33 

R33R33 




Therefore, the last M — (fci + fc; + fco) rows and columns 
of Q^GQ— and hence of (Q t EiG)" 1 (O t GQ)— are zeros, 
which lead to the M — (fci + fc; + fco) zero eigenvalues of 

it 



1 < j < fc , from the lower triangularity of L, it is clear that £1 2 GE 1 2 . We then drop these rows and columns in all 

matrices involved in subsequent analysis. In particular, the 
> q submatrix of G^EiQ = (ct 2 Im + A-Si^s-l) Q wim out the 

last M — (fci + fcj + fco) rows and columns is 



{U J j)\ = \{L 1 ) J j\ 



1=1 
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a 2 l 



M 








Rl2 

R22 




r\ 2 



RI2 



Define 



F 
a% 



V 

K R 33 Rl 

i?13 i?n 

R23 

[R33R 13 



R11 




R12 
R22 



R\i 



R\2 



R, 



22 







a 2 l kn 



R 



R 



-R\i 
^33-^3] 



R13R33 1 
R23RI3. 
R33RI3 



Due to the invariance of eigenvalues with respect to orthogonal 
transformations and switching to the symmetrized version, we 
focus on 

4t 



F 





1 

2 


V 






F 









K 


-R33-RL_ 




.0 a 2 I fco 



^FF" 



ri?33i?f 3 



Next we argue that the sorted positive eigenvalues of 
1 x 1 f are greater than the correspond- 

ing sorted eigenvalues of ^-^33-^33- 

For any e > 0, we define a matrix M e ^ = 

-Nl kl+ki 

^--#33^33 _ £l k a 



Then we have 



Note that 
to 



F-iVF-^ F"5ifti' 
±KF-i ^R 33 R f 33 . 

±KF-i 

a 



fci+fei 



- M e>N 

(T 



^F"5 £^0 



is congruent 



F - ' s^F - ^ 











Clearly F"i FF"^ + Nl kl+ki - -X.F-^K^KF-^ is pos- 
itive definite when N is sufficiently large. Hence, when N is 
large enough, we obtain 

"F-^VF-^t F-^Ftl" 

1 1 1 P >- M t 



1R33R33 
we 



Using Corollary 4.3.3 of [49], we conclude that the eigen- 

[F-sl/F-st F'^Fti] 
values of , 1 , f are greater than those 

L -J<F-* ^R 33 Rl 3 \ * 
of M 6 jv if sorted. From Proposition [T| we know that 



F-^VF-^ +Nl kl+kl F-ijfti] 

, _i CT has exactly fco posi- 

tive eigenvalues, which are the only eigenvalues that could 
be greater than A (^-^33-^33) — £■ Since e is arbitrary, we 
finally conclude that the positive eigenvalues of S^ 1 G are 
greater than those of ^2-^33 R-l 3 if sorted in the same way. 

For the second claim, we need some notations and prop- 
erties of symmetric and Hermitian matrices. For any pair of 
symmetric (or Hermitian) matrices P and Q, P -< Q means 
that Q — P is positive definite and P -< Q means Q — P 
is nonnegative definite. Note that if P and Q are positive 
definite, then from Corollary 7.7.4 of [49] P < Q if and 
only if Q^ 1 < P -1 ; if P < Q then the eigenvalues of P 
and Q satisfy Afc (F) < X k (Q), where X k (P) denotes the 
fcth largest eigenvalue of F; furthermore, A ^< B implies that 
PAP^ < PBP^ for any F, square or rectangular. Therefore, 



m + A SoSl A SoSi 



M 



A Sl A\ x = Si yields 



-,1/2 



a/2 



A 



SoS 



SqSi 



a/2 



Recall that from the definition of eigenvalues, the non-zero 
eigenvalues of AB and BA are the same for any matrices A 
and B. Since we are interested only in the eigenvalues, a cyclic 
permutation in the matrix product on the previous inequality's 
right-hand side gives us 



(a 2 I M + A SoSl AtjosJ 2 s o (* 2 Ui + A SoSl Al QSi 



So\Si 



= I M + (ct 2 Im + A SoSl Al oSi ) 2 A 

xA l \s 1 (^ l M + A SoSl Al oSi 
= lM + Q-*A So \ Sl Al oXSi Q-i 



p. 



Until now we have shown that the sorted eigenvalues of H 
are less than the corresponding ones of 1m + P. The non-zero 
eigenvalues of Q~i As \s-LA' s > Si Q~i is the same as the non- 
zero eigenvalues of A^^Q" 1 A So \ Sl ■< ^A\ q ^ Si A So \ Si . 
Using the same fact again, we conclude that the non-zero 
eigenvalues of ^Ajj , gi A^Sj is the same as the non-zero 

eigenvalues of ■ixAs \s 1 A] 3o < s _ L . Therefore, we obtain that 



A^X^o ) < K{Im+P) < A fe (l M + ^A So \ Sl Al ASi ) 

In particular, the eigenvalues of H that are greater than 
1 are upper bounded by the corresponding ones of + 
-i I Ag \s 1 Ag o ^ Si if they are both sorted ascendantly. Hence, 
we get that the eigenvalues of H that are greater than 1 are 
less than those of I feo + ^sA\ o y St A So \ Sl . 

Therefore, the conclusion of the second part of the theorem 
holds. We comment here that usually it is not true that H -< 
I m + ^A,s \s 1 A' ! So \ Si . Only the inequality on eigenvalues 
holds. ■ 
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Appendix C 
Proof of LemmaQ] 

For arbitrary fixed supports Si, Sj, we have 



> 




2k<t 2 l<l<k 



where 2nrf t ~ X% k (m-k-1+i) can ^ e written as a sum of 
2k{M — K — I + 1) independent squared standard Gaussian 
random variables and ft ~ X2 K (M-2K) * s obtained by drop- 
ping K — I + 1 of them. Therefore, using the union bound we 
obtain 



< 



< 



'{huS, <7} 

r i 



2k<t 2 KKk 



min ft < 7 



U [ft < 2 K a 2 7 ] 

l<2<fc d 

< k d P{ qi < 2na 2 1 ) . 



1 M-2K implies that 2kct 2 7 = ^ (M - 2K) < 



Since 7=3^ 
sufficiently large, we have 



2k(M - 2K) - 2, the mode of X 2 k (m-2K)< when M ~ 2K is 



< 



{ft < 2 K a 2 7 } 

T (k(M - 2K)) 

r 9 n rt(M-2_ff) 



c «(M~2/f)-l e -x/2 d:c 



T(k(M -2K)) 



The inequality logT (z) > (z — |) logz — z says that when 
M — 2X is large enough, 

P{ft < 2kct 2 7 } 

< exp {k(M — 2K) log (/tcr 2 7) — k<t 2 7 

k(M - 220 - ^ lo S I K ( M ^ 2K )\ 

+k(M - 2K)} 

< exp {-c(M~ 2K)}, 

where c < k (log 3 — 1). Therefore, we have 

P{A Si ,s 3 . < 7 } <2fexp{-c(M-22f)}. ■ 
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